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Abstract. We formally study two methods for data sanitation that have been used extensively in 
the database community: fc-anonymity and /-diversity. We settle several open problems concerning the 
difficulty of applying these methods optimally, proving both positive and negative results: 

- 2- anonymity is in P. 

- The problem of partitioning the edges of a triangle-free graph into 4-stars (degree-three vertices) 
is NP-hard. This yields an alternative proof that 3-anonymity is NP-hard even when the database 
attributes are all binary. 

- 3-anonymity with only 27 attributes per record is MAX SNP-hard. 

- For databases with n rows, fc-anonymity is in 0(4" • poly(n))) time for all k > 1. 

- For databases with £ attributes, alphabet size c, and n rows, k-Anonymity can be solved in 

- 3-diversity with binary attributes is NP-hard, with one sensitive attribute. 

- 2-diversity with binary attributes is NP-hard, with three sensitive attributes. 

1 Introduction 

The topic of data sanitization has received enormous attention in recent years. The high-level idea is to 
release a database to the pubhc in such a manner that two conflicting goals are achieved: (1) the data is 
useful to benign researchers who want to study trends and identify patterns in the data, and (2) the data 
is not useful to malicious parties who wish to compromise the privacy of individuals. Many different models 
for data sanitization have been proposed in the literature, and they can be roughly divided into two kinds: 
output perturbative models (e.g., [1,2]) and output abstraction models (e.g., [3-5]). In perturbative models, 
some or all of the output data is perturbed in a way that no longer corresponds precisely to the input data 
(the perturbation is typically taken to be a random variable with nice properties). This include work which 
assumes interaction between the prospective data collector and the database, such as differential privacy. 
In abstraction models, some of the original data is suppressed or generalized {e.g. an age becomes an age 
range) in a way that preserves data integrity. The latter models are preferred in cases where data integrity 
is the highest priority, or when the data is simply non-numerical. 

In this work, we formally study two data abstraction models from the literature, and determine which 
cases of the problems are efficiently solvable. We study fc-anonymity and ^-diversity. 



1.1 K-Anonymity 

The method of fc'-anonymization, introduced in [3,4], is a popular method in the database community for 
publicly releasing part of a database while protecting individual identities in that database. Formally speak- 
ing, an instance of the fc-anonymity problem is a matrix (a.k.a. database) with n rows and m columns with 
entries drawn from an underlying alphabet. Intuitively, the rows correspond to individuals and the columns 
correspond to various attributes of them. For hardness results, we study a special case called the suppression 
model, where the goal is to replace entries in the matrix with a special symbol * (called a 'star'), until each 
row is identical to at least fc — 1 other rows. The intuition is that the information released does not explicitly 
identify any individual in the database, but rather identifies at worst a group of size fc. ^ 

^ This intuition can break down when combined with background knowledge [5]. However, our intent in this paper 
is not to critique the security /insecurity of these methods, but rather to understand their feasibility. 



A trivial way to fc-anonymize a database is to suppress every entry (replacing all entries with but this 
renders the database useless. In order to maximize the utility of the database, one would like to suppress the 
fewest entries — this is the A:-Anonymity problem with suppression. Meycrson and Williams [6] proved that 
in the most general case, this is a difficult task: fc- Anonymity is NP-hard for k > 3, provided that the size 
of the alphabet is fl{n). Aggarwal et al. [7] improved this, showing that 3-Anonymity remains NP-hard 
even when the alphabet size is 3. Bonizzoni et al. [8] further improved the result to show that 3- Anonymity 
is APX-hard, even with a binary alphabet. They also showed that 4- Anonymity with a constant number of 
attributes per record is NP-hard. Two basic questions remain: 

1. How difficult is 3- anonymity with a small number of attributes per record? 

2. How difficult is the 2-anonymity problem? 

Addressing the two questions above, we discover both a positive and negative result. On the positive side, 
in Section 3 we present a polynomial time algorithm for 2-Anonymity, applying a result of Anshelevich 
and Karagiozova [9]: 

Theorem 1. 2-Anonymity is in P. 

The polynomial time algorithm works not only for the simple suppression model, but also for the most 
general version of fc-anonymity, where for each attribute we are given a generalization hierarchy of possible 
ways to withhold data. 

In Section 4, we consider fc-anonymity in databases where the number of attributes per record is constant. 
This setting seems to be the most relevant for practice: in a database of users, the number of attributes per 
user is often dwarfed by the number of users in the database. We find a surprisingly strong negative result. 

Theorem 2. 3- Anonymity with just 27 attributes per record is MAX SHP-hard. Therefore, 3- Anonymity 
does not have a polynomial time approximation scheme in this case, unless P = NP. 

The proof uses an alphabet with fl{n) cardinality. This motivates the question: how efficiently can we 
solve fc-anonymity with a small alphabet and constant number of attributes per record? Here we can prove 
a positive result, showing that when the number of attributes is small and the alphabet is constant, there 
are subexponential algorithms for optimal fc-anonymity for every fc > 1. 

Theorem 3. For every k > 1, an optimal k-anonymity solution can be computed in 0{4"-poly{n)) time, 
where n is the total number of rows in the database. 

Theorem 4. Let £ be the number of attributes in a database, let c be the size of its alphabet, and let n be 
the number of rows. Then k- Anonymity can be solved in 2°'-^ (^c) ) _^ qs^^^^ ^-^^^ 

This improves on results in [10]. Theorem 4 implies that k- Anonymity is solvable in polynomial time when- 
ever I < (log log n) /log c and c < logn. Theorem 4 also implies that for c = n°(^' and £ = 0(1), fc-anonymity 
is solvable in subexponential time. Therefore it is highly unlikely that we can tighten the unbounded alphabet 
constraint of Theorem 2, for otherwise all of NP has 2" * ' time algorithms. 

In Section 5, we provide an alternative proof that Binary 3-Anonymity, the special case of the problem 
where all of the attributes are binary- valued, is NP-hard. This result is weaker than [8] who recently showed 
that Binary 3-Anonymity is APX-hard. However, our proof also shows that a certain edge partitioning 
problem is NP-complete, which to the best of our knowledge is new Let Edge Partition Into Triangles 
and 4-Stars be the problem of partitioning the edges of a given graph into 3-cliques (triangles) and 4-stars 
(graphs with three degree- 1 nodes and one dcgrcc-3 node). 

Theorem 5. Edge Partition Into Triangles and 4-Stars is ISiF -complete. 

Theorem 5 implies that the Ternary 3-Anonymity hardness reduction given in [7] is sufficient to 
conclude that Binary 3-Anonymity is NP-hard. 

* Edge Partition Into Triangles is NP-Complete as is Edge Partition Into 4-Stars [11], but this does not 
imply that Edge Partition Into Triangles and 4-Stars is NP-Complete. 



2 



1.2 L-Diversity 

Finally, in Section 6 we consider the method of ^-diversity introduced in [5], which has also been well-studied. 
This method attempts to refine the notion of /c-anonymity to protect against knowledge attacks on particular 
sensitive attributes. 

Wc will work with a simplified definition of ^-diversity that captures the essentials. Similar to fc-anonymity, 
we think of an ^-diversity instance as a table (database) with m rows (records) and n columns (attributes). 
However, each attribute is also given a label q or s, inducing a partition of the attributes into two sets Q 
and S. Q is called the set of quasi-identifier attributes and S called the set of sensitive attributes. 

Definition 1. A database D is said to be ^-diverse if for every row uq of D there are (at least) £—1 distinct 
rows ui, ...,U£-i of D such that: 

1. \/q Q Q,0 < i < j < £ we have Ui[s] = Uj[s] 

2. IE S,0 < i < j < £ we have ui [s\ ^ Uj [s] 

Constraint 1 is essentially the same as fc-anonymity. Any row must have at least fc — 1 other rows whose 
(non sensitive) attributes are identical. Intuitively, Constraint 2 prevents anyone from definitively learning 
any row's sensitive attribute; in the worst case, an individual's attribute can be narrowed down to a set of 
at least £ choices. Similar to fc-anonymity with suppression, we allow stars to be introduced to achieve the 
two constraints. 

We can show rather strong hardness results for i!-diversity. 

Theorem 6. Optimal 2-diversity with binary attributes and three sensitive attributes is HP -hard. 

Theorem 7. Optimal 3-diversity with binary attributes and one sensitive attribute is NP-hard. 

Independent of their applications in databases, fc-anonymity and ^-diversity are also interesting from a 
theoretical viewpoint. They are natural combinatorial problems with a somewhat different character from 
other standard NP-hard problems. They are a kind of discrete partition task that has not been studied much: 
find a partition where each part is intended to "blend in the crowd. " Such problems will only become more 
relevant in the future, and we believe the generic techniques developed in this paper should be useful in 
further analyzing these new partition problems. 

2 Preliminaries 

We use poly(n) to denote a quantity that is polynomial in n. 

Definition 2. Let n and m be positive integers. Let S be a finite set. A database with n rows (records) and 
TO columns (attributes) is a matrix from The alphabet of the database is S. 

Definition 3. Let k be a positive integer. A database is said to be fc-anonymous or fc-anonymized if for 
every row ri there exist at least fc — 1 identical rows. 

As mentioned earlier, there are two methods of achieving fc-anonymity: suppression and generalization. 
In the suppression model, cells from the table are replaced with stars until the database is fc'-anonymous. 
Informally, the generalization model allows the entry of an individual cell to be replaced by a broader 
category For example, one may change a numerical entry to a range, e.g. (Age: 26 — >■ Age: [20-30]). A formal 
definition is given in Section 3.1. 

In our hardness results, we consider fc-anonymity with suppression. Since suppression is a special case 
of generalization, the hardness results also apply to fc-anonymity with generalization. Interestingly, our 
polynomial time 2-anonymity algorithm works under both models. 

Definition 4. Under the suppression model, the cost of a k-anonymous solution to a database is the number 
of stars introduced. 
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We let Cost\{D) denote the minimum cost of fc-anonymizing database D. 

For the proofs of hardness, wc introduce a few graph theoretical notions. Recall Kn denotes the complete 
graph on n vertices; we use the word triangle to denote K^. 

Definition 5. Let k >1. A fc-star is a simple graph with fc — 1 edges, all of which are incident to a common 
vertex v. v is called the center of the k-star. The other k — 1 vertices are called the leaves of the k-star. 

We also need a particular type of graph which we call a 3-binary tree. All interior nodes of such a tree 
have degree three. 

Definition 6. Let d Cz N be given. A 3-binary tree of depth d is a complete tree of depth d where the root 
has three children and all other nodes have two children. 

For our inapproximability results, we need the notion of an L-reduction [12]. 

Definition 7. Let A and B be two optimization problems and let f : A ^ B be a polynomial time computable 
transformation, f is an L-reduction if there are positive constants a and (3 such that 

1. OPT{f{I)) < a ■ OPT{I) 

2. For every solution of f{I) of cost ci we can in polynomial time find a solution of I with cost ci such that 

\OPT{I) ~ ci\ < (3 ■ \OPT{f{I)) ~ C2\ 

3 Polynomial Time Algorithm for 2-Anonymity 

Because 3-anonymity is hard even for binary attributes, it is natural to wonder if 2-anonymity is also 
difficult. However it turns out that achieving optimal 2-anonymity is polynomial time solvable. The resulting 
algorithm is nontrivial and would require heavy machinery to implement. We rely on a special case of 
hypergraph matching called SIMPLEX Matching, introduced in [9]. 

Definition 8. Simplex Matching; Given a hypergraph H ~ {V,E) with hyperedges of size 2 and 3 and a 
cost function c : E ^ N such that 

L (u, V, w) G E{H) =^> (m, v), (v, w), (u, w) G E{H) and 
2. c{u, v) + c{v, w) + c{u, w) <2 ■ c{u, v, w) 

find M ^ E such that for all v there is a unique e G M containing v, and X^eeA/ ^i^') minimized. 

Anshelevich and Karagiozova gave a polynomial time algorithm to solve Simplex Matching. We show 
that 2-Anonymity can be efficiently reduced to a simplex matching. 

Reminder of Theorem 1. 2- Anonymity is in P. 

Proof. Given a database D with rows ri, . . . , r„, let Ci_j denote the number of stars needed to make rows 
and rj. Similarly define C^.j^fc to be the number of stars needed to make ri,rj^rk all identical. Observe that 
in a 2-anonymization, any group with more than three identical rows could simply be split into subgroups 
of size two or three without increasing the anonymization cost. Therefore we may assume (without loss of 
generality) that the optimal 2-anonymity solution partitions the rows into groups of size two or three. 
Construct a hypergraph H as follows: 

1. For every row of D, add a vertex Vi. 

2. For every pair ri,rj, add the 2-edge {vi,Vj} with cost c(vi,Vj) ~ Ci_j. 

3. For every triple ri,rj,rk, add the 3-edge {vi,Vj,Vk} with cost c{vi,Vj,Vk) = Cij^k- 
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Thus is a hypergraph with n vertices and O(n^) edges. We claim that H meets the conditions of the 
simplex matching problem. The first condition is trivially met. Suppose we anonymize the pair of rows r^, rj 
with cost Ci,j, so both rows have ^Cj.j stars when anonyniized. Observe that if we decided to anonymize the 
group r.jjTj, r^;, the number of stars introduced per row would not decrease. That is, for all i,j,k we have 



By symmetry, we also have 
Adding the three inequalities together. 



Therefore H is an instance of the simplex matching problem. 

Finally, observe that any simplex matching of H corresponds to a 2-anonymization of D with the same 
cost, and vice-versa. □ 



3.1 The general case 

The proof of Theorem 1 also carries over to the most general case of 2- Anonymity, where instead of only 
suppressing entries with stars, we have a generalization hierarchy of possible values to write to an entry. We 
give a simple definition of generalization hierarchy that captures the essential features described in [4] . 

Definition 9. Let S be an alphabet of attributes, and let F D S. A generalization hierarchy is a rooted tree 
T on \r\ nodes with \S\ leaves, where the vertices v inT are put in one-to-one correspondence with alphabet 
symbols a{v) G F, the leaves are in one-to-one correspondence with the symbols of S, and all vertices v have 
a cost c{v) G N. The cost function satisfies the property that if u is the parent of v in T, then c{u) > c{v). 

The key property of a generalization hierarchy is that the cost function decreases as one moves from the 
root of T down to its leaves. Note the suppression model of fc- Anonymity can be modeled with a trivial 
generalization hierarchy: we can take a star graph T where the center of the star has symbol * and cost 1, 
while the leaves (corresponding to the letters of E) have cost 0. For any generalization hierarchy T, one can 
define the /c-Anonymity-T problem, where the goal is to replace some entries in a matrix from JJ"^™' with 
symbols in — Z' such that (a) every row is identical to at least k — I other rows, (b) every symbol replaced 
is a successor of the new alphabet symbol replacing it (in T), and (c) the total sum of costs associated with 
these new symbols is minimized. 

Tiieorem 8. For every generalization hierarchy T , fc-ANONYMlTY-T is in P. 

Proof. (Sketch) We define a hypergraph H just as in Theorem 1, but with new costs Ci,j and Cij-.fe reflecting 
the costs of a particular generalization hierarchy. One can still prove that the conditions for the simplex 
matching problem hold, using the fact that if u is any ancestor of v, then c{u) > c{v). This condition implies 
that if we have anonymized two rows , rj , adding a third row r^ to be anonymized cannot decrease the 
cost of anonymization per row. That is, the particular generalization symbols needed to make all three rows 
identical could only cost more than the symbols needed to anonymize the two rows originally. □ 

4 fc- Anonymity With Few Attributes 

We now turn to studying the complexity of fc-Anonymity with a constant number of attributes. First we 
show that for an unbounded alphabet, the 3-anonymity problem is still hard even with only 27 attributes. 
We use the following MAX SNP-hard problem in our proof. 
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Definition 10. Max 3DM-3 (Maximum 3-Dimensional Matching With 3 Occurrences) 

Instance: A set M C W x X x Y of ordered triples where W,X and Y are disjoint sets. The number of 

occurrences in M of any element in W,X orY is bounded by 3. Let 

Goal: Maximize Czdm{M') over all M' C AI such that no two elements of M' agree in any coordinate. 



Reminder of Theorem 2. 3- Anonymity with just 27 attributes per record is MAX SNP-Ziard. Therefore, 
3- Anonymity does not have a polynomial time approximation scheme in this case, unless P = NP. 

Proof. To show 3-anonymity is MAX SNP-hard, we show that there is an L-reduction from Max 3DM-3 to 
3-Anonymity with 27 Attributes [12], since it is known that Max 3DM-3 is MAX SNP-complete [13]. 
Given a Max 3DM-3 instance / = {M,W,X,Y), construct a 3- Anonymity instance D as follows: 

1. Define S = M IJ IJ X IJ F, so that it contains a special symbol for each triple in t G M and each 
element r 6 U U ^■ 

2. Add a row to D corresponding to each element 7'^ e TyyXljy. as follows. For r G M^yxyy, let 
tr.i,tr^2,tr,3 G M be the three triples of M which contain r (if there are less than three triples then 
simply introduce new symbols). 

- li r E W then add the fo llowing row to D: 



tr.l 


tr,l 


tr.l 


tr,l 


tr.l 


tr,l 


tr,l 


tr.l 


tr,l 


tr,2 


tr,2 




tr,3 



That is, the row contains nine copies of t^.i, nine copies of tr.2, then nine copies of ^^,3- 

- If r £ X, then add the r ow: 

tr.l ^r,l tr.l ^r,2 tr.2 ^r,2 ^r,3 ^r,3 ^r,3 tr.l tr.l • • ■ ^r,3 

- If 7' e y, then add the row: 

tr.l ^r,2 tr.3 ^r,l tr.2 ^r,3 ^r,l tr.2 ^r,3 tr.l tr.2 • • ■ ^r,3 

Suppose Wi e W, Xj & X,yk £ Y are arbitrary. Then the corresponding three rows in the database have 
the form: 



Wi 


twi , 1 


twi, 1 


twi^l 


twi^l 




twi,l 


twi^l 


twi^l 


twi , 1 




twi.3 


Xj 


txj.l 


txj.l 


txj,l 


txj,2 


txj,2 


txj,2 


txj ,3 


txj,3 


txj,3 




txj,3 


Uk 


hk,i 


hk,2 


tyk.3 




iyk,2 


^yk,3 


^Vk,! 


^yk,2 


hk,3 




hk,3 



Observe that D has a total of 277T, entries, where n = \X\ + \W\ + \Y\. Recall Cost\{D) is the optimal 
number of stars needed to 3-anonymize D. It is useful to redefine 3-Anonymity as maximization problem 
(where one maximizes the information released). Let P be a 3-anonymous solution to D, and define 

C3anon{P) = 1 — ^, 

27n 

so that OPT{D) 

Suppose D = {ri,...,r„} is an instance of 3- Anonymity obtained from the above reduction. Three 
properties are immediate from the construction of D: 

1. For any x rows r^, r^, rk,ri, where x > 4, the cost of anonymizing these rows is 

Cij.k.i = 27a; 

because there is no alphabet symbol that is used in all 4 rows. 

2. If {ri,rj,rk} ^ M then the cost of anonymizing the three corresponding rows is 

Cij,k = 3 • 27 = 81 
because there is no alphabet symbol that is used in all 3 rows. 
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3. If {ri, rj,rk} € M then the cost of anonymizing the three corresponding rows is 

C\,j,k = 3 • 26 = 78 

because the three rows match in exactly one of the 27 columns. 

These properties lead directly to the lemma: 

Lemma 1. There is a polynomial time mapping g from 3DM-3 feasible solutions to 3-anonymity feasible 
solutions, such that if M' C M is a 3 DM- 3 feasible solution then Czdm{M') = 27C3ANON{g{J^l')) ■ 

The proof of Lemma 1 is given in Appendix A. 

It remains for us to show that the above reduction is in fact an L-reduction. Let / be a Max 3DM-3 
instance, with corresponding 3- Anonymity instance /(/), and set a — f3 = 27. Now by Lemma 1 

OPT{f{I)) = ^OPT{I) < aOPT{I) 

so that condition (1) of an L-reduction holds. Similarly, if we have a solution of /(/) of cost C2, then again 
by Lemma 1 we can quickly compute a solution to / of cost ci = 27c2. Therefore, 

\OPT{I) - cil = |270PT(/(/)) - 27c2| = /3|0Pr(/(/)) - C2I 

so that condition (2) also holds. □ 

To complement the above bad news, we now give an efficient algorithm for optimal fc- anonymity when the 
number of attributes and the size of the alphabet are both small. Along the way, we also give an algorithm 
for the general /c-anonymity problem that runs in roughly 4" time (again n is the number of rows). 

A naive algorithm for fc-anonymity would take an exorbitant amount of time, trying all possible partitions 
of n rows into groups with cardinality between k and 2A: — 1. We can reduce this greatly using a divide-and- 
conquer recursion. 

Reminder of Theorem 3. For every k > 1, an optimal k-anonymity solution can be computed in 
0{A^^poly{n)) time, where n is the total number of rows in the database. 

Proof. Interpret our /c-anonymity instance S' as a multiset of n vectors drawn from Define Sk = {T : 
T C S, \T\ e [f , f + 2A:]}. That is, Sk contains all multisubscts which have approximately n/2 elements. 
Then 

Costk{S) = argminreSk [Costk{S ~T) + C'ostk{T)] , (1) 

where Costk{S) is the cost of the optimal fc-anonymous solution for S. Equation (1) holds because (without 
loss of generality) any fc-anonymized group of rows in a database is at most 2fc — 1, so we can always 
partition the fc-anonymized groups of a database into two multisets where their cardinalities are in the 
interval [n/2 - 2k, n/2 + 2k]. 

Suppose we compute the optimal fc- anonymity solution by evaluating equation (1) recursively, for all 
eligible multisubsets T. In the base case when l^l S [fc, 2fc — 1], we make all rows in S identical and return 
that solution. 

We can simply enumerate all 2" multisubsets of S to produce all possible T in equation (1). The time 
recurrence of the resulting algorithm is 

T{n) < 2"+i • T{n/2 + 2k) + 2". 

This recurrence solves to T{n) < 0{\ogn x 2i°g"2"+t +t+ - +i) < 0(4" • poly(n)) for constant fc. □ 
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Reminder of Theorem 4. Let £ be the number of attributes in a database, let c be the size of its alphabet, 
and let n be the number of rows. Then k- Anonymity can be solved in 2°('' (^c) ) _^ q^^^^ ^•^g_ 

If £ and c are constants then there are at most possible rows. To specify a group of k-anonymized 
rows we write G =< r',t > where t is the number of times the anonymized row r' occurs in the group. We 
can think of a k-anonymous solution as a partition of the rows into such anonymized groups. The following 
lemma will be useful for our algorithm. 

Lemma 2. Suppose that our database D contained at least k{2k~ 1) x 2^ copies of a row r. Then the optimal 
k-anonymity solution must contain a group containing just row r, ie. G —< r,t > where t > k. 

Proof. Suppose for contradiction that our database contains more than k{2k — 1) x 2*" copies of row r, but 
that our optimal solution did not contain a group G =< r,t >. Without loss of generality we can assume 
k < t < 2k — 1 for each group since larger groups could be divided into two groups without increasing the 
cost. Therefore, we must have at least fc x 2^ groups G =< r',t > which contain the row r. Notice that 
each attribute of r' cither matches r or is a Hence, there are at most 2^ possible values of r' . By the 
pigeonhole principle there must be at least k groups Gi =< r',ti > containing r whose anonymized rows 
r' are all identical. Merge these groups into one big group G =< r', Sf^^ti > at no extra cost. Each of the 
original k groups contained at least one copy of the row r so we can split G into two groups: < r,k > 
and G" =< r' , Sf^iti — fc > while saving at least k stars. Hence, our original solution was not optimal. 
Contradiction! □ 

For each row r we can define Index{r) to be a unique index between and c' — 1 by interpreting r as 
a £ digit number base c. Using Lemma 2, the following algorithm can be used to obtain a kernelization of a 
instance of k- Anonymity, in the sense of parameterized complexity [14]. 



Algorithm 1 k-anonymize a database D with small alphabet and few attributes 
Require: rowCount[i] — \\{r G D\Index{r) — i}\\ 
Require: c, £ small 
T ^ k{2k) X 2^ 
for Row r £ D do 
i •<— Indexij-) 
if rowCount[i] > T then 

print "<r,k >" {By Lemma 2} 
rowCount[i] -f- rowCount[i] — k 
end if 
end for 

{Now Vi, rowCount[i] < T so there are at most m < fc(2fc — l)2*(c^) = k[2k — l)(2c)^ rows remaining} 



Lemma 3. Algorithm 1 runs in 0{nt) time on a database D and outputs a database D' with at most 
0(fc^(2c)^) rows, with the property that an optimal k-anonymization for D' can be extended to an optimal 
k-anonymization for D in 0{n£) time. 

That is, for the parameter k + c + £, the k-anonymity problem is not only fixed parameter tractable, but can 
also be efficiently kernelized. 

Proof. (Sketch) By implementing rowCount as a hash table, each Index{r) and lookup opperation takes 0(£) 
time. Hence, set up takes 0{n£) time as does the loop. By lemma 2 there must be an optimal k-anonymity 
solution containing < r,t > with t > k whenever r occurs at least k{2k — 1)2^ times in D'. Therefore, if r 
occurs more than fc(2fc)2^ > fc + k{2k — 1)2^ times in D then there is an optimal k-anonymity solution which 
contains the groups < r,k > and < r,t > so adding back k copies of row r to D' does not change the optimal 
k-anonymization except for the extra < r,k > group. □ 
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Proof of Theorem 4- By lemma 3, Algorithm 1 takes an arbitrary fc-anonymity instance D and reduces it to 
a new instance D' with at most fc(2fc)(2c)^ rows in time 0{n). We can then apply Theorem 3 to k-anonymizc 
D' in time 0(4™poly(m)). The total running time is 2°('='(2c)*) o(ne). □ 

5 Hardness of 3-Anonymity With Binary Attributes 

In 2005, Aggarwal et al. [7] showed that 3- anonymity with a ternary alphabet is NP-hard. Their proof of 
hardness gives a reduction from Edge Partition Into Triangles, in which one is given a graph and is 
asked to determine if the edge set E can be partitioned into 3-sets such that each set corresponds to a copy 
of Kz- In particular, Aggarwal et al. first present a reduction from the problem of Edge Partition Into 
Triangles And 4-Stars^ into Binary 3-Anonymity. Then they introduce a third alphabet symbol to 
distinguish 4-stars from triangles in the reduction, concluding that a Ternary 3-Anonymity algorithm 
can be used to solve Edge Partition Into Triangles. 

We shall strengthen this result by directly proving that the Edge Partition Into Triangles And 
4-Stars problem is NP-Complcte. In fact, we establish the hardness of edge partitioning into 4-stars on 
triangle-free graphs. Using the aforementioned reduction of Aggarwal et al., the hardness of Binary 3- 
Anonymity follows from this result. 

Reminder of Theorem 5. Edge Partition Into 4-Stars is UP -Complete, even for triangle-free graphs. 

We describe the setup for Theorem 5 in the following paragraphs. The reduction will be from l-iN-3 
Sat, which is well-known to be NP-Complete [15]. Recall that in the l-iN-3 Sat problem, we are given a 
3-CNF formula (j) and are asked if there is a satisfying assignment to with the property that exactly one 
literal in each clause is true. We call a yes-instance of the problem l-in-3 satisfiable. Given a formula cf), the 
idea of our reduction is to create triangle-free graph gadgets- a gadget for each variable, and another type 
of gadget for each clause- and connect them in a (triangle-free) way such that <f> is l-in-3 satisfiable if and 
only if the resulting graph can be edge-partitioned into 4-stars. We first define a type of graph that shall be 
used to simulate the truth assignment of a variable in cf). 

Definition 11. Let d €N be given. The graph Gd is formed by taking two 3-Binary trees of depth d, deleting 
a leaf from exactly three different parents in each tree, and adding three edges so that the parents of deleted 
leaves in one tree are matched with the parents of deleted leaves in the other tree. 

In a copy of G^, we consider all edges adjacent to leaves to be shared edges, while all other edges are 
considered private. Intuitively, the shared edges are those that are shared with other gadgets in our final 
graph. We say that G contains a share- respecting copy of Gd if its vertex set can be partitioned into two sets 
S and T such that S is an induced copy of Gd-, and all edges crossing the cut {S,T) are adjacent to shared 
edges in S. 

To distinguish between the two trees in a copy of Gd, they arc arbitrarily designated as the top tree and 
bottom tree, respectively. 

The key property of the gadget Gd is given by the following claim, which says that (in a certain sense) 
the edges of Gd can be partitioned into 4-stars in precisely two ways. Figure 1(a) illustrates 5*5, where the 
dashed edges are shared and the solid edge is private. It can be found in Appendix B. 

Lemma 4. Let G be a graph containing a share-respecting copy of Gd. Assuming there is an edge partition 
of G into A-stars, exactly one of two cases must hold for that partition: 

1. All shared edges belonging to the top tree of Gd are contained in A-stars with centers in Gd, while all 
shared edges belonging to the bottom tree are contained in 4-stars with centers not contained in Gd. 

2. All shared edges belonging to the bottom tree of Gd are contained in 4-stars with centers in Gd, while all 
shared edges belonging to the bottom tree are contained in 4-stars with centers not contained in Gd. 

^ This problem is: Given a graph G = {V,E), is it possible to partition the edge set E into 3-sets such that each 
3-set corresponds to either a copy of or a 4-star? 
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In the first case of the claim, we say that the copy of Gd is true partitioned, and in the second case we 
say that Gd is false partitioned. Intuitively, each copy of Gd in our final graph will correspond to a variable 
in 0, and a true/false partition shall correspond to assigning that variable true/false. Lemma 4 is proved in 
Appendix D. 

We now define another type of graph that shall be used as gadgets to represent clauses in a given l-in-3 
SAT formula. 

Definition 12. The graph S5 is a 5-star with one of its edges labeled private and the other three edges 
labeled shared. 

Figure 1(b) illustrates 5*5, where the dashed edges are shared and the solid edge is private. It can be 
found in Appendix B. 

Suppose a graph G contains a share- respecting copy of 85^, so that one node adjacent to the private edge 
of has degree one. Call this node v and its adjacent node u (the center of S5). Then, any partition of G 
into 4-stars must contain a 4-star with u as its center, using the edge (w, v). But this 4-star must use two of 
the shared edges in 5*5. Therefore an edge-partition of G into 4-stars is possible if and only if exactly one of 
the shared edges in 5*5 participates in a 4-star with a center that is outside of 6*5 . 

We arc finally ready to prove Theorem 5. 
Proof of Theorem 5. Let an l-iN-3 Sat instance (p be given with clauses Ci, . . . , Gm and variables xi, . . . , x„. 
We wish to create a triangle-free graph G^ that can be edge-partitioned into 4-stars if and only if is l-in-3 
satisfiable. 

G^ is constructed as follows: 

- For each variable Xi , let ki denote the number of clauses that Xi occurs in (or the number of clauses that 
Xi occurs in, whichever is greater). Let di be the integer satisfying 3 • 2'^'"^ < 3(fci + 1) < 3 • 2'*'"^. Add 
a copy of the graph Gdi to G^, calling it Ai. Note that Ai has at least 3(fci + 1) leaves. 

- For each clause Gi = {h V V Z3), add three copies of 6*5 to G^, caUing them Bi^i, Bi^2, Bi^^. 

Join the shared edges of these subgraphs as follows: if the literal Ij = xu is in Gi, then merge one shared 
edge from each of Bi i, Bi 2, Bi ^ with three shared edges from the top tree of Ai; otherwise, if Ij = Xk is 
in Gi, then merge a shared edge from each of Bi i,Bi 2,Bi 3 with three shared edges from the bottom tree 
of Ai. As a heuristic use a shared edges which is incident to another unused shared edge in Gd^, whenever 
possible. 

Since Ai has a 3 • 2"^'^^ > 3{ki + 1) leaves, there may remain some shared edges in some Ai that have 
not been merged with shared edges from copy of 6*5. We deal with these extra shared edges as follows: Take 
three shared edges from different parents in the top tree of Ai, and merge their end vertices with a new 
vertex to form a 4-star. Repeat until all unused shared edges from the top are used and do the same for the 
shared edges on the bottom. Note that this is possible because we used three copies of S5 for each clause; 
hence, shared edges from the top/bottom of each gadget are taken in multiples of three, and the number of 
leaves in every Ai is a multiple of three. By the above heuristic for choosing unused shared edges we will 
never create a multi edge. 

Clearly the above reduction can be done in polynomial time. Also note that by construction, G^ contains 
no triangles. We now argue that the formula is l-in-3 satisfiable if and only if G^ can be edge-partitioned 
into 4-stars. Supposing that (/) is satisfiable, partition each variable gadget according to its assignment in 
a given satisfying assignment. In particular, if a variable is set to true, then true-partition the edges in its 
corresponding variable gadget. Each clause gadget can be partitioned into a 4-star, since exactly one of its 
shared edges are used. The remaining edges are already part of 4-stars by construction and can hence be 
partitioned. 

For the other direction, suppose that G^ can be partitioned into 4-stars. By Claim 4, each copy of Gd is 
either true or false partitioned. Now each clause gadget can be partitioned if and only if exactly one of its 

^ A copy of S5 is share respecting if and only the center vertex has degree 4 and the leaf incident to the private edge 
has degree 1. 
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shared edges is used by a 4-star with a center in a variable gadget. By construction, this happens iff exactly 
one of the literals in the clause was assigned true in the partition for its variable gadget. Thus the partition 
defines a satisfying assignment for (j). 

Finally, note that the constructed in Theorem 5 is triangle-free; in particular, is bipartite. To see 
this, note that each Ai is bipartite, each Bi j is bipartite, and for each of these subgraphs, its set of shared 
edges come from only one side of its bipartition. □ 

While we have given a complete description above, the construction of is perhaps better understood 
through examples. We have provided two examples in Appendix B. 

Corollary 1. Edge Partition Into Triangles and 4-Stars is HP-Complete. 

Corollary 2. Binary 3- Anonymity is HP-Complete. 

Proof. Aggarwal et al.\J] showed that there is a polynomial time reduction from Edge Partition Into 
Triangles And 4-Stars to Binary 3-Anonymity. Their reduction is repeated in Appendix C for com- 
pleteness. □ 

6 Hardness of Computing ^-diversity 

Finally, we consider an alternative privacy model called ^-diversity, which strengthens the privacy guarantees 
of the fc-anonymity model. It was first proposed to prevent certain background knowledge attacks which could 
potentially be used against a fc-anonymized dataset [5] . In the model, we distinguish between which attributes 
of the database are merely potentially identifying and which are highly sensitive. Those which are highly 
sensitive require a strong privacy guarantee. 

Definition 13. The cost of a (.-diverse solutions is the number of stars introduced, among the attributes 
q E Q, to the database. 

The fact that optimal 3-diversity with binary attributes and one sensitive ternary attribute is NP-hard 
should not be too surprising, in light of our proofs of hardness for 3-anonymity. Intuitively, the extra sensitive 
attribute constraint should make 3-diversity only harder than 3-anonymity. What is perhaps surprising is 
that optimal 2-diversity is N P-hard for databases with three sensitive attributes per row, in light of our result 
that optimal 2-anonymity is in P. 

Reminder of Theorem 6. Optimal 2-diversity with binary attributes and three sensitive attributes is 
NP-hard. 

Proof. The reduction is from edge partition into triangles which is known to be NP-Complcte even when the 
graph is tripartite [16]. The idea for the reduction is similar to the reductions in [7] for binary A:-anonymity 
(see Appendix 6). Given a graph G = {V,E), define a 2-diversity instance as follows: the rows of the table 
correspond to each e € E, while the columns correspond to the n = \V\ vertices of G plus the sensitive 
attributes s^^, s^^ , s^^ . Given an arbitrary ordering of the vertices V — {vi, u„} and edges E — {ei, Cm} 
define a matrix R'~^ as follows: 



[ 1 otherwise; 

The cost of grouping any three rows in a 2-divcrse solution is at least three stars because the graph 
is simple. Furthermore, any group of more than three rows will require more than three stars per row to 
2-divcrsify. The proof is argument is identical to lemma 6 in Appendix C. 




Let Vb, Vl, V2 be the tripartition of vertices in G. Now label: 



if e, n Vj = 0; 
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Lemma 5. Any group of only two rows in R'^ violates the 2-diversity constraint. 

Proof. Let i,j be any pair of distinct rows. Because the graph is tripartite, either Si-^ = Sj-^ or — sj^ or 
else = . The diversity constraints for two rows will look like: 





Q 


s 


ei 




1 


1 









1 





1 



□ 

Similarly, the diversity constraints coupled with the fact that the graph is 3-Partite also prevent us from 
choosing three rows corresponding to a 4-star in G because the rows would share a sensitive attribute. 
However, the diversity constraints do allow for the possibility that the three rows correspond to a triangle 
in G as illustrated in the table: 





Q 


s 






1 


1 









1 





1 


ek 







1 


1 



Thus the edges of G can be partitioned into triangles iff the 2-diversity instance has a solution that 
introduces exactly 3 stars per row. □ 



Reminder of Theorem 7. Optimal i-diversity with binary attributes is NP hard, with only one sensitive 
ternary attribute. 

Proof. The hardness reduction for 3-diversity with one sensitive attribute is essentially the same as above. 
Assume that G is tripartite, and let Vi,V2,V3 be the three partite sets in G. Let Si denote the sensitive 
attribute for row Vi . If 3-diversity is to be feasible then the sensitive attribute Si must be allowed to take at 
least three values. Other attributes must be binary. 

{1 if = (x, y), with x £ Vi, ?/ £ T^; 
2 if a = (x, z), with X e Vi, z £ Vy, 
3 if Ci = {y, z), with y eV2,z e V3; 

As before, the diversity constraints now prevent us from grouping three rows which correspond to a 
4-star. Groups of rows which do not correspond to a triangle in G still require more than three stars per 
row. Thus the edges of G can be partitioned into triangles iff the 3-diversity instance has a solution that 
introduces exactly 3 stars per row. □ 



7 Conclusion 

We have demonstrated the hardness and feasibility of several methods used in database privacy, settling 
several open problems on the topic. The upshot is that most of these problems are difficult to solve optimally, 
even in very special cases; however in some interesting cases these problems can be solved faster. Several 
interesting open questions address possible ways around this intractability: 

- To what degree can the hard problems be approximately solved? For example, the best known approx- 
imation algorithm for fc-anonymity, given by Park and Shim [17], suppresses no more than O(logfc) 
times the optimal number of entries. Could better approximation ratios be achieved when the number 
of attributes is small? 
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- The best known running time for Simplex Matching is 0{n^ + n^m?) steps [9]. Here, n is the number 
of nodes and m is the number of hyperedges in the hypergraph. In our algorithm for 2-anonymity, n is 
also the number of rows in the database while m = = 0{'n?) because wc add a hypcrcdgc for every 
triples. Hence our algorithm for 2-Anonymity has running time 0(n^). Can this exponent be reduced to 
a more practical running time? 
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A Proof of Lemma 1 

Recall we had the following three properties of the database D in the reduction of Theorem 2: 
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1. For any x rows r,;, r^, r^, r;, where x > 4, the cost of anonymizing these rows is 

Cij^kA = X - 27 

because there is no alphabet symbol that is used in all 4 rows. 

2. If {ri, rj, r^] ^ M then the cost of anonymizing the three corresponding rows is 

Ci = 3-27 

because there is no alphabet symbol that is used in all 3 rows. 

3. If {ri^rj,rk\ G M then the cost of anonymizing the three corresponding rows is 

Ci,j,k =3-26 

because the three rows will match in exactly one of the 27 columns. 

Reminder of Lemma 1. There is a polynomial time mapping g from 3DM-3 feasible solutions to 3- 
anonymity feasible solutions, such that if M' C M is a 3 DM- 3 feasible solution then Czdm{M') = 27C^ANON{g{M')). 

Proof. We use the reduction defined in the proof of Theorem 2. Recall that in a 3-anonymity solution P is a 
partition of the rows into groups of size 3, 4 and 5. By the three properties of D, any group which does not 
correspond to triple from M must be suppressed entirely. Hence, we can think of the solution as a partition 
of the rows into triples {xi,yj,Wk) from M and some other rows. Similarly, we can think of a 3DM solution 
as a partition of the elements into triples from M and some other elements. Thus we can define a polynomial 
time computable transformation / between 3DM-3 solutions and 3-anonymity solutions. 
By the above properties of £>, Costl{g{M')) = 27n - 3|M'|. Therefore, 

_ 2,\M'\ 

n 

^„ 27n-3|M'| 

= 27 ^ 

n 

Cost\{g{M') 
n 

^ 27 -CsANONigiM')) 

□ 

B Edge Partition into 4-Star Reduction - Examples 

Figure 1 shows examples of a variable gadget and a clause gadget. 

Example 1. (p = (x, y, z)(x, y, z). Note that the l-iN-3 Sat formula has two satisfying assignments: (x ~ 
t,y ~ t,z = /), {x ~ f,y = f,z = /). Similarly, the corresponding graph (shown in figure 2) can be 
partitioned into 4-stars in exactly two ways, both corresponding to the satisfying assignments. 

Example 2. (j) ^ (x, y, z)(x, y, z). Note that is not l-in-3 satisfiable. Similarly, the corresponding graph 
(shown in figure 3) cannot be edge-partitioned into 4-stars. 
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(a) Example gadget: G3. The private edges are solid and (b) S5: 4-Star Clause Gadget 
the shared edges are dashed. The selection of the three 
deleted leaves is arbitrary. 



Fig. 1. Gadgets 



C Reducing Edge Partition Into Triangles And 4-Stars to Binary 3- Anonymity 

Given a graph G ~ (V, E) with m edges and n vertices build the following table: the rows of the table 
correspond to each edge e € E, while the columns correspond to the n = \V\ vertices of G. Given an 
arbitrary ordering of the vertices V = {ui, w„} and edges E — {ei, ...em} define a database as follows: 



Clearly this reduction takes polynomial time. Note that, because the graph G is simple, any 3-anonymous 
solution must include at least three stars per row. This follows because for any set of three edges, there are 
at least three vertices that arc incident with one, but not all, of the three edges. Furthermore, note that if 
a set of three edges do not form a triangle or 4-star, then there are at least four vertices that are incident 
with one (but not all) of the three edges. The result follows from lemma 6. 

Lemma 6. Let m be the number of edges in G, the cost of the optimal 3-anonymous solution for is 3m 
stars iff the graph G can be edge partitioned into A-stars and triangles. 

Proof. First, suppose that the cost of the optimal 3- Anonymity solution is 3m. Since each row has at least 
3 stars in it, each row must have exactly three stars because there are m rows in R'~^. Given a set of three 
identical (anonymized) rows, each row has 3 stars each corresponding to a vertex that was incident to one, 
but not all of the three edges represented by those rows. Hence, those three edges form either a 4-star or a 
triangle. Therefore, the edges of the graph can be partitioned into triangles and 4-stars. 
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Fig. 2. (t> = (x,y,z)(x,y,z) 



For the other direction, suppose that G can be partitioned into triangles and 4-stars. Group the rows of 
the 3-anonymity instance according to the partition. Now consider a group of three rows in the table that 
correspond to the edges of a 4-star. The three rows have the form: 





•••llOO--- 




...lOlO--. 




■ • • 1001 • • • 



where the • • • are all O's. For a triangle, the three rows corresponding to its edges looks like: 





...110... 


{vo,V2) 


...101... 


{VI,V2) 


• ••on-- 



where again the • • • are all O's. Clearly, both groups of rows can be made identical by suppressing only 
three entries per row. Hence, the table can be made 3-anonymous with 3m stars. □ 

D Edge Partitioning Gd into 4-Stars 

Recall the statement of lemma 4 

Lemma 7. Let G be a graph containing a share-respecting copy of Gd- Assuming there is an edge partition 
of G into A-stars, exactly one of two cases must hold for that partition: 

1. All shared edges belonging to the top tree of Gd are contained in A-stars with centers in Gd, while all 
shared edges belonging to the bottom tree are contained in 4-stars with centers not contained in Gd- 
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Fig. 3. 4>= (x,y,z)(x,y,z) 



2. All shared edges belonging to the bottom tree ofGd are contained in 4-stars with centers in Gd, while all 
shared edges belonging to the bottom tree are contained in 4-stars with centers not contained in Gd- 

Proof. Let G be a graph which contains a share-respecting copy of Gd along with an edge partition P of 
G into 4-stars. Note that every vertex in Gd has degree 3, even the leaves in Gd are have two more shared 
edges in Gd- If an internal vertex x in a 3-Binary Tree is the center of a 4-star in P (see figure D) then its 
parent y cannot be the center of any 4-star in P because its degree has been reduced to 2. This means that 
z must be the center of a 4-star in P to cover the edge (y, z). Similarly, if x was not the center of a 4-star 
then y must be the center of a 4-star to cover the edge {x,y). 

Now the pattern becomes evident: the parent of z cannot be the center of a 4-star so z's grandparent 
must be the center of a 4-star, and so on. Therefore, in any edge partition, if there is a 4-star centered at 
vertex v at depth i, then the ancestors of v at depths i — 2,i — A, ... as well as the descendents at depths 
i + 2,i + 4, . . . must all be centers of 4-stars as well. 

Consider the root of the 3-Binary Tree at depth 0, there are only two possible scenarios. Scenario 1, the 
root is the center of a 4-star and all the vertices (descendents) at depths 0, 2, 4, . . . in that 3-Binary Tree 
must also be the centers of 4-stars. Scenario 2, the root is not the center of a 4-star and all the vertices 
(descendents) at depths 1,3... must be centers of 4-stars. 

By construction of Gd there must be exactly three edges between the top and bottom 3-Binary Trees in 
Gd- Pick one such edge {u,v), any edge partition of G must use the edge (u,w) so either u or w must be 
the center of a 4-star. Without loss of generality assume that u is the center of a 4-star and that u is in the 
bottom 3-Binary Tree. Notice that both u and v are at depth c? — 1 in their respective trees. Assume that 
d is odd (the proof is similar for d even), then in the bottom tree we are in Scenario 1, but in the top tree 
we are in Scenario 2. All shared edges belonging to the bottom tree of Gd are contained in 4-stars centered 
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Fig. 4. X is the center of a 4-star, therefore y cannot be the center of a 4-star. In any partition the edge {y, z) must 
be covered by a 4-star centered at z. 



at depth d — 1, while no shared edges from the top tree of Gd can be contained in 4-stars centered at depth 
d- 1. □ 
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